Welcome to: A crash course on using machine learning methods effectively in practice
Some materials and illustration are based on Chapter 2 and 3 of Mathematical Engineering of Deep Learning
Data are available at https://github.com/benoit-liquet/MAEDL/
Clustering allows to identify meaningful groups, or clusters, among the data points and find representative centers of these clusters.
Samples within each cluster are more closely related to one another than samples from different clusters.
Clustering is the act of associating a cluster \(\ell\) with each observation, where \(\ell\) comes from a small finite set, \(\{1,\ldots,K\}\).
Clustering algorithm works on the data \(\mathcal{D}\) and outputs a function \(c(\cdot)\) which maps individual data points to the label values \(\{1,\ldots,K\}\).
K-means algorithm is one very basic, yet powerful heuristic algorithm.
With K-means, we pre-specify a number \(K\), determining the number of clusters.
\(K\) may be treated as a hyper-parameter.
The algorithm seeks the function \(c(\cdot)\), or alternatively the partition \(C_1,\ldots,C_K\), it also seeks representative centers (also known as centroids), of the clusters, denoted by \(J_1,\ldots,J_K\), each an element of \({\mathbb R}^p\).
\[\begin{equation} \label{eq:kMeansObj} \text{Clustering cost} = \sum_{\ell = 1}^K \sum_{x \in C_\ell} || x- J_\ell ||^2. \end{equation}\]
Generally computationally intractable since it requires considering all possible partitions of \(\mathcal{D}\) into clusters.
Can be approximately minimized via the K-means algorithm using a classic iterative approach.
The K-means algorithm: two sub-tasks called mean computation, and labelling.
Mean computation: Given \(c(\cdot)\), or a clustering \(C_1,\ldots,C_K\), find \(J_1,\ldots,J_K\) that minimizes \[\begin{equation} \label{eq:meanComp} J_\ell = \frac{1}{|C_\ell|} \sum_{x \in C_\ell} x, \qquad \text{for} \qquad \ell= 1,\ldots,K. \end{equation}\]
Labelling: Given, \(J_1,\ldots,J_K\), \(c(\cdot)\) is defined as \[\begin{equation} \label{eq:labelStep} c(x) = \textrm{argmin}_{\ell \in \{1,\ldots,K\}} ~||x -J_\ell ||. \end{equation}\]
The label of each element is determined by the closest center in Euclidean space.
library(animation)
set.seed(101)
library(mvtnorm)
x = rbind(rmvnorm(40, mean=c(0,1),sigma = 0.05*diag(2)),rmvnorm(40, mean=c(0.5,0),sigma = 0.05*diag(2)),rmvnorm(40, mean=c(1,1),sigma = 0.05*diag(2)))
par(mfrow=c(3,2))
colnames(x) = c("x1", "x2")
kmeans.ani(x, centers = matrix(c(0.5,1,0.5,0,1,1),byrow=T,ncol=2))unsupervised image segmentation via K-means clustering
Each pixel of the image is considered a point in \(\mathcal{D}\) and the dimension of each point is typically \(p=3\) (red, green, and blue) for color images.
Can produce impressive image segmentation without any other information except for the image.
Example: color image is a \(n = 640\times 640=409,600\) pixel color image (\(p=3\)).
Run K-means algorithm which groups similar pixels based on their attributes and assigns the attributes of the corresponding cluster center to the pixel in the image.
library(ggplot2)
library(jpeg)
img <- readJPEG("Yoni-ben-pool-seg.jpg")
# Obtain the dimension
imgDm <- dim(img)
# Assign RGB channels to data frame
imgRGB <- data.frame(
x = rep(1:imgDm[2], each = imgDm[1]),
y = rep(imgDm[1]:1, imgDm[2]),
R = as.vector(img[,,1]),
G = as.vector(img[,,2]),
B = as.vector(img[,,3])
)
par(mfrow=c(3,1))
# Plot the original image
p1 <- ggplot(data = imgRGB, aes(x = x, y = y)) +
geom_point(colour = rgb(imgRGB[c("R", "G", "B")])) +
labs(title = "Original Image") +
xlab("x") +
ylab("y")
p1
kClusters <- 2
kMeans <- kmeans(imgRGB[, c("R", "G", "B")], centers = kClusters)
kColours <- rgb(kMeans$centers[kMeans$cluster,])
p2 <- ggplot(data = imgRGB, aes(x = x, y = y)) +
geom_point(colour = kColours) +
labs(title = paste("k-Means Clustering of", kClusters, "Colours")) +
xlab("x") +
ylab("y")
p2
kClusters <- 6
kMeans <- kmeans(imgRGB[, c("R", "G", "B")], centers = kClusters)
kColours <- rgb(kMeans$centers[kMeans$cluster,])
p6 <- ggplot(data = imgRGB, aes(x = x, y = y)) +
geom_point(colour = kColours) +
labs(title = paste("k-Means Clustering of", kClusters, "Colours")) +
xlab("x") +
ylab("y")
p6This is done while still retaining most of the variability present in the data.
PCA offers a low-dimensional representation of the features that attempts to capture the most important information from the data.
\[\begin{equation} X=\begin{bmatrix} \vert & &\vert \\ x_{(1)} &\dots &x_{(p)} \\ \vert & & \vert \end{bmatrix} \quad \textrm{with} \quad x_{(i)}=\begin{bmatrix} x_i^{(1)} \\ \vdots \\ x_i^{(n)} \end{bmatrix} \end{equation}\]
\[ \tilde{x}_{(i)} = v_{i,1} \begin{bmatrix} \vert \\ x_{(1)} \\ \vert \end{bmatrix} + v_{i,2} \begin{bmatrix} \vert \\ x_{(2)} \\ \vert \end{bmatrix} + ~ \ldots ~ + v_{i,p} \begin{bmatrix} \vert \\ x_{(p)} \\ \vert \end{bmatrix} \quad \text{for} \quad i=1,\ldots,m, \]
each new \(n\) dimensional vector, \(\tilde{x}_{(i)}\), is a linear combination of the original features.
\(\tilde{x}_{(i)} = X v_i\) where \(v_i = (v_{i,1},\ldots,v_{i,p})\) is called the loading vector for \(i\)
\[\begin{equation} \label{eq:pca-mat} \underbrace{\begin{bmatrix} \vert & &\vert \\ \tilde{x}_{(1)} & \dots \hspace{-0.3cm}&\tilde{x}_{(m)} \\ \vert & & \vert \end{bmatrix}}_{\underset{\textrm{Reduced data}}{\widetilde{X}_{n\times m}}} = \underbrace{ \begin{bmatrix} \vert & & &\vert \\ x_{(1)} & \dots &\dots &x_{(p)} \\ \vert & && \vert \end{bmatrix} }_{\underset{\text{Original de-meaned data}}{X_{n\times p}}}\times \underbrace{ \begin{bmatrix} \vert & &\vert \\ \vert & &\vert \\ v_1 & \dots \hspace{-0.3cm}&v_m \\ \vert & &\vert \\ \vert & & \vert \end{bmatrix}}_{\underset{\textrm{Matrix of loading vectors}}{\widetilde{V}_{p\times m}}}. \end{equation}\]
Wisconsin breast cancer data: \(p=30\) and \(n=569\).
Aim to visualize this data using PCA we set \(m=2\)
[1] 569 32
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
library(ggplot2)
library(dplyr)
pca <- PCA(data[,-c(1,2)],ncp=2,graph=FALSE)
dat <- data.frame(data,pc1=pca$ind$coord[,1],pc2=pca$ind$coord[,2],diagnosis=as.factor(data[,2]))
#dat <- dat %>% filter(pc1<7 & pc2<10)
p1 <- ggplot(data = dat, aes(x = pc1, y = pc2))+
geom_hline(yintercept = 0, lty = 2) +
geom_vline(xintercept = 0, lty = 2) +
geom_point(alpha = 0.8,size=2.5) + theme_bw()
p1 + theme(axis.text = element_text(size = 20))+ theme(axis.title = element_text(size = 20)) eigenvalue percentage of variance cumulative percentage of variance
comp 1 13.281608 44.27203 44.27203
comp 2 5.691355 18.97118 63.24321
p2 <- ggplot(data = dat, aes(x = pc1, y = pc2, color = diagnosis))+
geom_hline(yintercept = 0, lty = 2) +
geom_vline(xintercept = 0, lty = 2) +
geom_point(alpha = 0.8,size=2.5) + theme_bw()+
theme(legend.position=c(0.15,0.85),legend.title=element_blank())Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2 3.5.0.
Please use the `legend.position.inside` argument of `theme()` instead.
p3 <- p2 + scale_color_discrete( labels = c("benign", "malignant"))
p3 + theme(legend.text=element_text(size=20),axis.text = element_text(size = 20))+ theme(axis.title = element_text(size = 20))The PCA framework tries to project the data in the directions with maximum variance.
Since \(\tilde{x}_{(i)}=Xv_i\) we can formulate this by maximizing the sample variance of the components of \(\tilde{x}_{(i)}\).
\(\tilde{x}_{(i)}\) is a \(0\) mean vector, its sample variance is \(\tilde{x}_{(i)}^\top \tilde{x}_{(i)}/n\).
We have, \[ \text{Sample variance of component}~i = \frac{1}{n} v_i^\top X^\top X v_i = v_i^\top S v_i, \] where \(S\) is the sample covariance of the data.
It turns out the a very useful way to represent the loading vectors \(v_1,\ldots,v_m\) is by normed eigenvectors associated with eigenvalues of the sample covariance matrix \(S\)
\(S\) is symmetric and positive semi-definite, the eigenvalues of \(S\) are real and non-negative, a fact which allows us to order them via \(\lambda_1 \ge \lambda_2 \ge \ldots \ge \lambda_p \ge 0\).
We then pick the loading vector \(v_i\) to be a normed eigenvector associated with \(\lambda_i\), namely,
\[\begin{equation} \label{eq:eig-pca} S v_i = \lambda_i v_i, \end{equation}\]
\[\begin{equation} \label{eq:reduced-svd} X=U \Delta V^\top=\sum_{i=1}^{r} \delta_{i} \, u_{i} \, v_{i}^\top, \quad \text{with} \quad \Delta=\textrm{diag}(\delta_1,\ldots,\delta_r), \quad \text{and} \quad \delta_i > 0. \end{equation}\]
\(n \times r\) matrix \(U\) and the \(p \times r\) matrix \(V\) are both with orthonormal columns denoted \(u_i\) and \(v_i\) respectively for \(i=1,\ldots,r\).
Columns are called the left and right singular vectors respectively.
\(\delta_i\) in the \(r \times r\) diagonal matrix \(\Delta\) are called singular values and are ordered as \(\delta_1 \geq \delta_2 \geq \cdots \geq \delta_r>0\).
SVD representation of the sample covariance: \[ S = \frac{1}{n} \underbrace{V\Delta^\top U^\top}_{X^{\top}}\underbrace{U\Delta V^\top}_{X} = \frac{1}{n} V \Delta^2 V^\top, \quad \text{with} \quad \Delta^2=\textrm{diag}(\delta_1^2,\ldots,\delta_r^2). \]
Here the fact that \(U\) has orthonormal columns implies \(U^\top U\) is the \(r\times r\) identity matrix and hence it cancels out:
\[\begin{equation} \label{eq:S-svd} S = \sum_{i=1}^r \frac{\delta_i^2}{n} \, v_i \, v_i^\top. \end{equation}\]
Compare to the eigenvector based representation of PCA:
Using the Spectral decomposition of \(S\) \[\begin{equation} \label{eq:S-spectral} S = \widetilde{V}^\top \Lambda \widetilde{V} = \sum_{i=1}^r \lambda_i \,v_i \, v_i^\top. \end{equation}\]
Thus, \(\lambda_i=\delta^2_i/n\) and the loading vectors in spectral decomposition are the right singular vectors in SVD: \(\widetilde{V} = V\).
Further, to obtain the data matrix of principal components, \(\widetilde{X}\) we set \(\widetilde{X} = X V\). Using the SVD, PCA can be represented:
\[\begin{equation} \label{eq:pca-svd-relationship} \widetilde{X} = \underbrace{~\, U \Delta V^\top}_{X} V = U \Delta = \begin{bmatrix} \vert & \vert & &\vert \\ \delta_1 u_1 & \delta_2 u_2 \hspace{-0.3cm} & \dots \hspace{-0.3cm}& \delta_r u_r \\ \vert & \vert & & \vert \end{bmatrix}. \end{equation}\]
The singular value decomposition can also be viewed as a means for compressing any matrix \(X\).
A rank \(m<r\) approximation of \(X\) is,
\[\begin{equation} \label{eq:svd-approx} \widehat{X} = \sum_{i=1}^{m} \delta_{i} \, u_{i} \, v_{i}^\top \approx X, \qquad \text{where} \qquad X- \widehat{X} = \sum_{i={m+1}}^{r} \delta_{i} \, u_{i} \, v_{i}^\top. \end{equation}\]
The rank of \(\widehat{X}\) is \(m\) and since one often uses \(m\) significantly smaller than \(r\), this is called a low rank approximation.
For small enough \(\delta_{m+1}\) the approximation error is negligible since the summation of rank one matrices \(\delta_{i} \, u_{i} \, v_{i}^\top\) for \(i=m+1,\ldots,r\) is small.
The number of values used in this representation of \(\widehat{X}\) is \(m\times (1+ n + p)\) and for small \(m\) this number is generally much smaller than \(n \times p\) which is the number of values in \(X\).
Hence this may viewed as a compression method.
We seek to have the best rank \(m\) approximation in terms of minimization of \(\|X-\widehat{X}\|_F\).
Frobenious norm noted \(\| A \|_F\): square root of the sum of the squared elements of the matrix \(A\)
Low rank approximations established by Eckart–Young-Mirsky theorem.
\[\begin{equation} \label{eq:ek-yng} \underset{\widehat{X}\text{ of rank }m}{\min}\left\|X - \widehat{X}\right\|^2_F= \left\| X - \sum_{i=1}^ m\delta_i \, u_i v_i^{\top}\right\|^2_F = \sum_{i=m+1}^{r}\delta_i^2. \end{equation}\]
if (!"jpeg" %in% installed.packages()) install.packages("jpeg")
# Read image file into an array with three channels (Red-Green-Blue, RGB)
myImage <- jpeg::readJPEG("CODE_WORKSHOP/pool_graysacle.jpg")
r <- myImage[, , 1]
# Performs full SVD
myImage.r.svd <- svd(r)# ; lmyImage.g.svd <- svd(g) ; myImage.b.svd <- svd(b)
rgb.svds <- list(myImage.r.svd)#
plot.image <- function(pic, main = "") {
h <- dim(pic)[1] ; w <- dim(pic)[2]
plot(x = c(0, h), y = c(0, w), type = "n", xlab = "", ylab = "", main = main)
rasterImage(pic, 0, 0, h, w)
}
compress.image <- function(rgb.svds, nb.comp) {
# nb.comp (number of components) should be less than min(dim(img[,,1])),
# i.e., 170 here
svd.lower.dim <- lapply(rgb.svds, function(i) list(d = i$d[1:nb.comp],
u = i$u[, 1:nb.comp],
v = i$v[, 1:nb.comp]))
img <- sapply(svd.lower.dim, function(i) {
img.compressed <- i$u %*% diag(i$d) %*% t(i$v)
}, simplify = 'array')
img[img < 0] <- 0
img[img > 1] <- 1
return(list(img = img, svd.reduced = svd.lower.dim))
}
par(mfrow = c(2, 2))
plot.image(r, "Original image")
p <- 10 ; plot.image(compress.image(rgb.svds, p)$img[,,1],
paste("SVD with", p, "components"))
p <- 30 ; plot.image(compress.image(rgb.svds, p)$img[,,1],
paste("SVD with", p, "components"))
p <- 50 ; plot.image(compress.image(rgb.svds, p)$img[,,1],
paste("SVD with", p, "components"))The input \(x \in \Re^p\) is transformed into a bottleneck, also called the code which is some \(\tilde{x} \in \Re^m\) and is the hidden layer of the model.
Then the bottleneck is further transformed into the output \(\hat{x} \in \Re^p\).
The part of the autoencoder that transforms the input into the bottleneck is called the encoder and the part of the autoencoder that transforms the bottleneck to the output is called the decoder. Both the encoder and the decoder have parameters that are to be learned.
Interestingly for input \(x\), once parameters are trained, we generally expect the autoencoder to generate output \(\hat{x}\) that is as similar to the input \(x\) as possible.
Consider the activity of data reduction where the dimension of the bottleneck \(m\) is significantly smaller than the input and output dimension \(p\)
A trained autoencoder yields \(x \approx \hat{x}\) then it means that we have an immediate data reduction method.
With the trained encoder we are able to convert digit images, each of size \(28 \times 28 = 784\), into much smaller vectors, each of size \(30\).
With the trained decoder we are able to convert back and get an approximation of the original image. This choice of \(m\) implies a rather remarkable compression factor of about \(26\).
\[\begin{equation} \label{eq:autoencoder-loss} C_i(\theta) = \| x^{(i)} - f_\theta(x^{(i)}) \|^2 \quad \text{and thus} \quad C(\theta \, ; \, \mathcal D) = \frac{1}{n} \sum_{i=1}^n \| x^{(i)} - f_\theta(x^{(i)}) \|^2. \end{equation}\]
With this loss structure, learning the parameters, \(\theta\), of an autoencoder based on data \(\mathcal D\) is the process of minimizing \(C(\theta \, ; \, \mathcal D\).
\[ \hat{x} = f_\theta(x) = \big(f^{[2]}_{\theta^{[2]}} \circ f^{[1]}_{\theta^{[1]}}\big) (x) = f^{[2]}_{\theta^{[2]}}\big( f^{[1]}_{\theta^{[1]}}(x) \big), \]
\[\begin{equation} \label{eq:encoder-decoder} \begin{array}{rcl} f^{[1]}_{\theta^{[1]}}(u)&=&S^{[1]}(b^{[1]} + W^{[1]} u) \quad \text{for} \quad u \in \Re^p \qquad (\textrm{Encoder})\\ f^{[2]}_{\theta^{[2]}}(u)&=&S^{[2]}(b^{[2]} + W^{[2]} u) \quad \text{for} \quad u \in \Re^m \qquad (\textrm{Decoder}), \end{array} \end{equation}\]
The encoder parameters \(\theta^{[1]}\) are composed of the bias \(b^{[1]}\in \Re^{m}\) and weight matrix \(W^{[1]}\in \Re^{m \times p}\)
The decoder parameters \(\theta^{[2]}\) are composed of the bias \(b^{[2]}\in \Re^{p}\) and weight matrix \(W^{[2]}\in \Re^{p \times m}\).
\[\begin{equation} \label{eq:compete-2layer-auto-theta} \theta = (b^{[1]},W^{[1]},b^{[2]},W^{[2]}). \end{equation}\]
Vector activation functions \(S^{[1]}(\cdot)\) and \(S^{[2]}(\cdot)\): we construct these based on scalar activation functions \(\sigma^{[\ell]}: \Re \to \Re\) for \(\ell=1,2\).
Specifically, we set \(S^{[\ell]}(z)\) to be the element wise application of \(\sigma^{[\ell]}(\cdot)\) on each of the coordinates of \(z\)
\[\begin{equation} \label{eqn:activation-function-vector} S^{[\ell]}(z)=\left[ \begin{matrix} \sigma^{[\ell]}\left(z_{1}\right) \\ \vdots\\ \sigma^{[\ell]}\left(z_{r}\right) \end{matrix} \right]. \end{equation}\]
\[\begin{equation} \label{eq:costautoencoder} C(\theta \, ; \,\mathcal D)=\frac{1}{n} \sum_{i=1}^n\ \|x^{(i)}-\underbrace{S^{[2]}(W^{[2]}S^{[1]}(W^{[1]}x^{(i)}+b^{[1]})+b^{[2]})}_{f_{\theta}(x^{(i)})}\|^2. \end{equation}\]
Autoencoders generalize principal component analysis (PCA)
PCA is essentially a shallow auto-encoder with identity activation functions \(\sigma^{[\ell]}(u) = u\) for \(\ell=1,2\), also known as a linear autoencoder.
PCA yields one possible solution to the learning optimization problem for linear autoencoders.
consider using an autoencoder on MNIST where \(p=28\times28 = 784\) and we use \(m=2\).
We encode this via PCA, a shallow non-linear autoencoder, and a deep autoendoer that has hidden layers.
The autoencoders are trained on the training set and the codes presented are both for the training set, and for the testing set data.
We color the code points based on the labels. This allows us to see how different labels are generally encoded onto different regions of the code space.
One application of such data reduction is to help separate the data
It is evident that as model complexity increases better separation occurs in the data.
In terms of reconstruction, it is also evident in this case that more complex models exhibit better reconstruction ability.
This model learns to remove noise during the reconstruction step for noisy input data.
It takes in partially corrupted input and learns to recover the original denoised input.
It relies on the hypothesis that high-level representations are relatively stable and robust to entry corruption and that the model is able to extract characteristics that are useful for the representation of the input distribution.
Take \(x^{(i)}\) and \(x^{(j)}\) from \(\mathcal D = \{x^{(1)},\ldots,x^{(n)}\}\) and consider the convex combination \[ x_\lambda^{\text{naive}} = \lambda x^{(i)} + (1-\lambda) x^{(j)}, \] for some \(\lambda \in [0,1]\).
\(x_\lambda^{\text{naive}}\) is a weighted average between the two observations.
\(\lambda\) captures which of the observations has more weight.
Such arithmetic on the associated feature vectors is too naive and often meaningless.
When considering the latent space representation of the images it is often possible to create a much more meaningful interpolation between the images.
Train an autoencoder and then encode \(x^{(i)}\) and \(x^{(j)}\) to obtain \(\tilde{x}^{(i)}\) and \(\tilde{x}^{(j)}\).
Then interpolate on the codes, and finally decode \(\tilde{x}_\lambda\) to obtain an interpolated image.
\[ {x}_\lambda^{\text{encoder}} = f^{[2]} \Big(\lambda f^{[1]}({x}^{(i)}) + (1-\lambda) f^{[1]}({x}^{(j)}) \Big). \]
The following coge will take some times to run
# To install the R package downlaod the tar.gz file at https://cran.r-project.org/src/contrib/Archive/ruta/
library(ruta)
library(rARPACK)
library(ggplot2)
###############
### Function plot
###############
plot_digit <- function(digit, ...) {
image(keras::array_reshape(digit, c(28, 28), "F")[, 28:1], xaxt = "n", yaxt = "n", col=gray(1:256 / 256), ...)
}
plot_sample <- function(digits_test, model1,model2,model3, sample) {
sample_size <- length(sample)
layout(
matrix(c(1:sample_size, (sample_size + 1):(4 * sample_size)), byrow = F, nrow = 4)
)
for (i in sample) {
par(mar = c(0,0,0,0) + 1)
plot_digit(digits_test[i, ])
plot_digit(model1[i, ])
plot_digit(model2[i, ])
plot_digit(model3[i, ])
}
}
#######################
#### Load MNIST DATA
#######################
mnist = keras::dataset_mnist()
# Normalization to the [0, 1] interval
x_train <- keras::array_reshape(
mnist$train$x, c(dim(mnist$train$x)[1], 784)
)
x_train <- x_train / 255.0
x_test <- keras::array_reshape(
mnist$test$x, c(dim(mnist$test$x)[1], 784)
)
x_test <- x_test / 255.0
if(T){
network <- input() + dense(30, "tanh") + output("sigmoid")
network1 <- input() + dense(50, "tanh") +dense(10, "linear")+dense(50, "tanh") +output("sigmoid")
}
### model simple
network.simple <- autoencoder(network)#, loss = "binary_crossentropy")
model = train(network.simple, x_train, epochs = 10)
decoded.simple <- reconstruct(model, x_test)
### model deep
my_ae2 <- autoencoder(network1)#, loss = "binary_crossentropy")
model2 = train(my_ae2, x_train, epochs = 10)
decoded2 <- reconstruct(model2, x_test)
#### Linear interpolation between two digits
digit_A = x_train[which(mnist$train$y==3)[1],]#MNIST digit with 3 (This is the first digit in the train set that has 3)
digit_B = x_train[which(mnist$train$y==3)[10],]#another MNIST digit with 3 (This is the 10[th] digit in the train set that has 3)
latent_A = encode(model2,matrix(digit_A,nrow=1))
latent_B = encode(model2,matrix(digit_B,nrow=1))
lambda = 0.5
latent_interpolation = lambda*latent_A + (1-lambda)*latent_B
rought_interpolation = lambda*digit_A + (1-lambda)*digit_B
output_interpolation = decode(model2,latent_interpolation)
par(mar = c(0,0,0,0) + 1,mfrow=c(1,3))
plot_digit(digit_A)
plot_digit(as.vector(output_interpolation))
plot_digit(digit_B)
par(mar = c(0,0,0,0) + 1,mfrow=c(1,3))
plot_digit(digit_A)
plot_digit(as.vector(rought_interpolation))
plot_digit(digit_B)In this challenge, we will use an unsupervised method (Principal Component Analysis) in combination with a supervised method (Sigmoid model) to perform a classification task. This approach reduces data dimensionality to retain the most important information before training a classifier. We will work on the breast cancer dataset. The main steps:
-1. Data Preparation
- Load the dataset and split it into training and testing sets.
- Separate the feature columns from the target variable in both sets.
-2. Dimensionality Reduction with PCA
- Standardize the feature data, then apply PCA to reduce dimensionality.
- Transform both the training and test sets using the trained PCA model.
-3. Model Training
- Train logistic regression models on the PCA-transformed training data
- Use different numbers of principal components.
-4. Model Evaluation
- Evaluate each model on the PCA-transformed test set using accuracy
- Provide confusion matrices to determine the optimal number of components.